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significance testing (NHST) with t-tests, analysis of variance, or some form 
of the general linear model. This paper contends that, at least for program 
evaluation, the focus of NHST and its recommended alternatives misses the 
target of what evaluators really need to know about a program's success: (1) 

how meaningful to the client the changes attributable to the program are; (2) 
how many participants actually achieved these changes; and (3) how practical 
the program was. An approach is recommended that borrows from approaches used 
in the medical sciences to accumulate replicative evidence from repeated 
applications of the same program or from programs of a similar nature. This 
allows a defensible descriptive analysis of program effectiveness and 
efficiency. Taking a more descriptive approach means examining sustainable 
clinical improvement that results from the program. Using sustainable 
clinical improvement and the indicators of practical significance produces 
answers and makes statistical inference welcome, when warranted, but not 
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New Indicators for Program Evaluation 
John C. Hanes and Michael Hail 

Many program evaluations involve some type of statistical testing to verify that 
the program has, indeed, succeeded in accomplishing initially established goals. In many 
cases this takes the form of null hypothesis significance testing (NHST) with either t- 
tests, analysis of variance (ANOVA), or some form of the general linear model (GLM). 
Commonly utilized in the social, behavioral, and health sciences as well, such approaches 
have received increasing criticism from methodologists in their respective communities 
for many years. Whole volumes carry the debate (Harlow, Mulaik, & Steiger, 1997; 
Morrison & Henkel, 1970), and some journals devote single issues to the topic (The 
Journal of Experimental Education . 1993, Vol. 61, No. 4). Several classic articles 
provide reasons for extreme caution when using statistical inference - see Carver (1978), 
Meehl (1978), and Cohen (1994). Incidentally, Carver and Meehl lament the continuing 
reliance on significance testing in more recent articles (Carver, 1993; Meehl, 1997). 

This paper contends that, at least for program evaluation, the focus of NHST and 
its recommended alternatives misses the target of what we really need to know about a 
program’s success. Finagle’s New Laws of Information state that: 1.) the information we 
have is not what we want, 2.) the information we want is not what we need , and 3.) the 
information we need is not available (Peers, 1978). Often, in program evaluation, we 
have what we say we want, statistical significance testing, which is usually not what we 
need for several reasons. Infrequently, we have confidence intervals, power analyses, 
effect sizes, and meta-analysis. While these move closer to what we need, they still fail 
to tell us three things: how meaningful to the clients were the changes which might be 
attributable to the program, how many program participants actually achieved these 
changes relative to a comparison group, and how practical was the program. Fortunately, 
this information is available. 

The various sciences form a rough hierarchy of research rigor with regard to the 
selection and assignment of subjects, the control of subjects and variables, the type and 
range of treatments, the precision and accuracy of instrumentation, and the enforceability 
of protocols. Generally, the physical sciences supercede the biological, medical, and 
health sciences, which, in turn, have somewhat more credibility in these regards than the 
behavioral and social sciences. Evaluation occupies a place among the latter group. 

When analyzing evidence for the effect of an experimental manipulation in the 
physical sciences, a mathematically predictive theory, strong instrumentation, and 
extensive, varied, external replication (a cumulative process) make the argument without 
need of statistical significance testing (Meehl, 1978). “Replicated results automatically 
make statistical significance unnecessary” (Carver, 1978). The medical sciences look to 
the randomized controlled trial, or RCT (The Standards of Reporting Trials Group, 

1994), with a large sample size (Moore, Gavaghan, Tramer, Collins, & McQuay, 1998) 
as the gold standard for determining treatment effects, and this provides the foundation 
for various tests in terms of statistical significance, confidence intervals, and clinical 
efficacy. The behavioral and social sciences seemingly seek to emulate the medical 
sciences without, for the most part, the imprimatur of the RCT and sufficiently large 
sample sizes. When the social sciences occasionally imitate the physical sciences by 
utilizing replication, they often employ internal replication in the form of cross- 
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validation, jackknife, or bootstrap procedures. This dependence on a single sample tends 
to inflate estimates of replicability (Thompson, 1993). 

For program evaluation, we propose borrowing several approaches from the 
medical sciences while dropping any unjustified inferential associations and using these 
approaches as a springboard for accumulating replicative evidence from repeated 
applications of the same program or from programs of a similar nature, much like the 
physical sciences do. Utilizing this combination allows a defensible descriptive analysis 
of program effectiveness and efficiency, both individually and in a comparative mode; 
our evaluation moves to a more exploratory orientation accompanied by numerical, 
counting, and graphical detective work (Tukey & Tukey, 1988). 

Statistical Significance Testing 

In program evaluation, the warrant for using statistical significance testing often 
breaks down on basic assumptions underlying the methodology. Many times a program 
lacks both random selection and random assignment. Without one or the other, 
comparison of test statistics with a reference distribution carries little meaning because 
the theoretical mathematical curve of the reference distribution is generated with the 
assumption of random sampling (Shaver, 1993). “. . .accurate significance testing 
requires randomization (random sampling or assignment) to be interpretable” (Biskin, 
1998). 

Behavioral science, social science, and program evaluation data often fail to meet 
assumptions of independence, equal variance (homoscedasticity), and normal distribution 
for the error term of the dependent variable(s). As Stevens (1996) points out with 
reference to ANOVA and MANOVA, violation of the independence assumption has 
serious consequences and occurs quite often. If interventions involve interactions among 
the individuals receiving treatment, then independence is at risk, and random sampling or 
random assignment does not solve the problem. Abuse of the independence assumption 
in the social sciences usually involves a misunderstanding about the unit of analysis, 
occurs frequently, and often escapes the scrutiny of a field’s top journal editors (Hykle, 
Stevens, &Markle, 1993). 

While re-expressions or transformations may help to some extent for unequal 
variances and non-normality, violations of these assumptions dictate caution in 
interpretation of the test results. Certain statistical tests exhibit robustness in the face of 
assumption violations, but discerning the actual intervention or treatment effects becomes 
increasingly difficult as the experimental design advances in complexity (Biskin, 1998). 

Recognized variables sometimes suffer poor quality and reliability in their 
application, and a plethora of additional variables, some spurious and some of 
consequence, either make an unnecessary appearance in the research model or fail to 
receive adequately measured attention during program administration. Such 
measurement and specification errors can produce disastrous results for the evaluation 
(Pedhazur, 1997), and, as in the case of the independence assumption, even elite journals 
overlook the absence of important measurement components in their published studies 
(Whittington, 1998). 

Threats to internal and external validity may directly or obliquely effect statistical 
significance, and program evaluation provides a rich target for such problems. 

Systematic error or bias may enter the evaluation via such threats, and amelioration does 
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not proceed from larger sample sizes (Cochran, 1983). Internally, the issues of history, 
maturation, testing, instrumentation, statistical regression, selection, mortality, and 
various interactions of these variables dictate careful attention to a program evaluation’s 
design (Campbell & Stanley, 1966). Likewise, interaction effects of selection biases and 
the dependent variable(s), the reactive effects of both pre-testing and the treatment 
setting, and the effects due to multiple treatments limit the generalizability of the program 
and may alter statistical testing. Note that the American Psychological Association’s 
Task Force on Statistical Inference gives an interesting overview of the major areas for 
concern in any behavioral or social science research endeavor (Wilkinson, 1999). Even 
in the RCT’s of clinical medicine, only rigorous attention to the details of design and 
methodology can avoid bias and the associated distortions of statistical results (Schulz, 
Chalmers, Hayes, & Altman, 1995). 

Even if a particular program’s design and methodology meets the above 
challenges, the issue of the meaning of a statistical significance test remains. 
Misinterpretations abound. To reiterate the corrections: the p-value is not the probability 
that the null hypothesis is true (Cohen, 1994); the p-value does not generally address 
replicability - see Abelson’s (1995) discussion; and rejection of the null hypothesis does 
not affirm the theory being tested (Cohen, 1994). In the latter case, the null hypothesis of 
no difference between two populations invariably fails because, “...in the social sciences 
everything correlates with almost everything else” (Meehl, 1997), and a sufficiently large 
sample size will tease out the difference when couched in a statistical significance test 
(Thompson, 1993). The calculated p-value actually and merely expresses the probability 
(0 to 1.0) of observing a value of a particular test statistic as extreme (either as large or as 
small) or more extreme than the one observed, “given the sample size, and assuming that 
the sample was derived from a population in which the null hypothesis (Ho) is exactly 
true” (Thompson, 1996). 

Most importantly, the p-value and the related NHST tell nothing about the value, 
magnitude, or importance of the substantive result. In program evaluation a statistically 
significant outcome may have no valid meaning for the client population who perceive 
little change in their status despite a reported positive outcome on some measure of 
knowledge, attitude, or behavior. 

It should be noted that NHST receives a proper defense by Chow (1996), Abelson 
(1997), and Mulaik, Raju, and Harshman (1997). Chow, in particular, places NHST in 
the appropriate experimental and logical context, contexts that are so often lacking in 
program evaluation. 



Other Approaches 

Many of the NHST critics referenced above advocate a variety of alternatives 
including point estimation with confidence intervals, power analysis, effect sizes, and 
meta-analysis. These all offer an improvement on NHST, but they also fail to provide 
information about the absolute importance of the treatment effect or how many subjects 
actually reached any targeted improvement. 

Confidence intervals present a range of values which contain a population 
parameter with a particular degree of probability, and they link closely with statistical 
significance testing (Gardner & Altman, 1986). Because of this link, they carry the same 
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requirements for randomization and assumption satisfaction as NHST; they also fail “to 
settle issues raised about the inductive conclusion validity” for a study (Chow, 1996). 

Statistical power, the ability to detect a particular effect size difference due to an 
intervention via statistical significance testing, also ties itself to the NHST requirements. 
Greater power or sensitivity derives from manipulations of the sample size, alpha level, 
statistical test, and effect size. However, Shaver (1993) wonders “What is the purpose of 
power analysis and the arbitrary manipulation of criteria in order to help ensure that the 
researcher will obtain a desired level of probability, when statistical significance has so 
little meaning?” 

Effect sizes provide metrics that are independent of the sample size and can also 
be independent of the scale of measurement. “For a given dependent measure, effect size 
can be thought of simply as the difference between the means of the experimental versus 
control populations” (Lipsey, 1998). However, this absolute effect size has dependence 
on the scale of measurement. Standardizing the difference between the means allows the 
effect size to escape this dependency. The proportion of variance in the dependent 
variable that is explained by the independent variable offers another way to express effect 
size (Kellow, 1998). Various design and analysis decisions will have important 
consequences for the effect size observed (Posavac, 1998). 

Unfortunately, the effect size, like NHST, can lead to misinterpretation. A rather 
arbitrary understanding of low, medium, and large effect sizes has developed with .2, .5, 
and .8 representing these values, respectively (Shaver, 1993). In many situations, 
relatively small effects may actually carry strong value, especially in cases “where there 
are processes by which individually tiny influences cumulate to produce meaningful 
outcomes" (Abelson, 1985), or in cases where either the independent variable undergoes 
only minimal manipulation or the dependent variable is difficult to influence (Prentice & 
Miller, 1992). At the opposite extreme, a large effect size does not necessarily translate 
into a meaningful outcome for the subjects of a program intervention (Jacobson & Truax, 
1991; Lipsey & Wilson, 1993). 

Effect sizes do enable the production of numerical averages for the combination 
of many studies through meta-analysis (Chow, 1988). Meta-analysis springs from 
appropriate effect size measures and their compatibility across studies, thus aggregating 
the strength of many evaluations. Meta-analysis has its own set of problems. Some arise 
from the reporting quality of the research database (Orwin & Cordray, 1985), particularly 
in regard to publication bias or the “file-drawer problem” where a number of studies, 
often those lacking statistical significance or sufficiently large effect size, cannot be 
located for inclusion in the analysis (Givens, Smith, & Tweedie, 1997). Others stem 
from a lack of independence across similar studies, confusion about the appropriate unit 
of analysis, confounding problems with weighted means, and misuse of tests of 
heterogeneity (Hall & Rosenthal, 1995). Rarely, meta-analyses relying on sets of small 
studies hold the potential to mislead practitioners, and a large RCT may be required to 
clarify understanding (Egger & Smith, 1995). Nevertheless, if employed with due 
diligence, meta-analysis reinforces the importance of replication for evaluative Judgment. 

Sustainable Clinical Improvement 

Taking a more descriptive tack motivates several questions. What do we really 
need to know about the outcome of a program, and how can we report what we need to 
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know given the available data? Statistical significance and effect sizes may or may not 
carry meaning for the clients in a particular program evaluation. This depends on what is 
important to the clients (with input from others) and whether an obtained valuable result 
continues beyond an immediate time frame as defined by the clients and evaluators. 
Clinical significance defines this valuable result or the minimal important difference 
(Guyatt, Juniper, Walter, Griffith, & Goldstein, 1998); in other words, clinical 
significance represents a treatment or intervention’s “ability to meet standards of efficacy 
set by consumers, clinicians, and researchers” (Jacobson & Truax, 1991). This is the 
overriding substantive finding (Pedhazur, 1997), a finding whose definition rests 
increasingly with the patient according to some in the medical sciences (Guyatt et al., 
1998). 

Sustainable clinical improvement refers to client change that stabilizes at or above 
a previously agreed upon level of success for a previously agreed upon period of time. 
Some notion of eventual temporal stability should accompany the achievement of a 
targeted goal because an episodic distribution of success over time leaves a fleeting sense 
of accomplishment for many clients and because major fluctuations in benefit lead to 
underestimation or overestimation of the measure of efficacy (Laupacis, Sackett, & 
Roberts, 1988). Thus clients and evaluators must make two decisions up front: what 
level on a particular indicator represents a successful program outcome for a client and 
how long, within reason, must this level be maintained to acclaim that client truly 
improved with longevity. Thompson (1993) argues the need for these types of decisions 
and notes that “Statistics can be employed to evaluate the probability of an event. But 
importance is a question of human values, and math cannot be employed as an atavistic 
escape (a la Fromme’s Escape from Freedom) from the existential human responsibility 
for making value judgements.” The determination of value requires reasonable 
judgement. 

The approach recommended here means that the criterion for value receives 
commitment before the program has started. Others follow a post hoc, statistical line of 
argument and allow the distributions of the treatment and comparison (control if truly 
experimental) groups to dictate the demarcation of program effectiveness. In addressing 
the issue of “what effect size is worth detecting?” Lipsey (1998) proposes Cohen’s U3 
measure (Cohen, 1977) and the binomial effect size display (BESD) of Rosenthal and 
Rubin (Rosenthal & Rubin, 1982) as possible candidates to answer the question. Cohen’s 
U3 sets the mean of the control group as the success threshold while the BESD utilizes 
the grand median for the conjoint control and treatment distribution in the same role. 

Both assume normal distributions for either the control group or the control and treatment 
groups, respectively. With the normality assumption in place, Jacobson and Truax 
(Jacobson & Truax, 1991) develop a reliable change index utilizing the standard error of 
measurement to help determine if “real change” occurs. Although these methods have 
increasing value as a program nears the nature of a true experiment, we prefer controlling 
the criteria with reference to client, rather than statistical, input. 

With sustainable clinical improvement established, the medical sciences provide a 
measure for quantifying a program’s efficacy, the number needed to treat (NNT). 

Defined as the reciprocal of the absolute risk reduction (Laupacis et al., 1988), the NNT 
offers easy computation (Fig. 1) if the evaluator knows the total number of clients in the 
treatment (or active) and comparison groups and also the number who reached 
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sustainable clinical improvement in each group. This information should always be 
available. Where a comparison group cannot be employed, the ratio formed for the 
comparison group (Ic/Tc) may be estimated from past experience or from similar 
populations. Note that the denominator in the NNT equation, the absolute risk reduction, 
is simply the difference between the event rate in the treatment group and the event rate 
in the comparison group, and this is a different type of expression for effect size (Hall & 
Rosenthal, 1995). An evaluator might take the approach of determining a critical effect 
size and asking if the mean of the treatment group exceeded this value when the 
comparison group, on average, did not. With the NNT, an evaluator asks how many of 
those in the treatment group, as a proportion, exceeded the critical effect size (defined by 
sustainable clinical improvement) than did those in the comparison group, as a 
proportion. 

The lower the value for the NNT, the more efficacious is the program as 
measured on a particular indicator or dependent variable. A value of five means that five 
clients must receive the program’s intervention or treatment for one client to reach 
sustainable clinical improvement. Obviously, a value of one indicates a perfect score; 
treating one client results in sustainable clinical improvement. Negative values favor the 
comparison group over the treatment group, and the program is actually impeding 
improvement in such cases. When the treatment and comparison groups have the same 
proportionate results, the NNT calculation involves division by zero yielding an 
undefined value, and this alerts the evaluator to the absence of efficacy. Very large 
values for the NNT make clear just how many clients must be served to achieve a single 
success, and the simplicity of the NNT promotes easy comparisons across different 
programs and for a single program with multiple applications over time. 

Because the NNT comes from the experimentally oriented health science arena, 
the calculation of confidence intervals accompanies the statistic (Cook & Sackett, 1995). 
This is always desirable if the proper statistical warrant exists, and the intervals help to 
protect against embracing a strong, but not statistically significant, NNT that was 
generated from a small sample (Ware, Mosteller, Delgado, Donnelly, & Ingelfinger, 
1992). 

Using the simple head counting of the NNT avoids the parametric problems of 
means based statistics. The influence of outliers and the effects of averaging may make it 
seem that program efficacy has been achieved when, in terms of the number of clients 
actually achieving improvement, this is not the case. Often these numbers become lost in 
the good news that averaging can afford as Bracey (1999) demonstrates in the field of 
education. 



Practical Significance 

Many authors use the terms such as “practical significance” and “clinical 
significance” synonymously (Kirk, 1996; Lipsey & Wilson, 1993; Rosenthal, 1990). We 
propose distinguishing these terms by connecting practical significance to the efficiency 
of a program with reference to such variables as time, cost, contact hours, client 
satisfaction, and adverse effects. Clinical significance, of course, relates to a program’s 
efficacy as discussed above. 

The median time to reach sustainable clinical improvement for those clients 
achieving this goal provides one measure of practical significance. Likewise, the mean 
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or median cost per program participant yields another such measure; contact hours are 
similarly computed. A Likert scale for overall client satisfaction, constructed with a 
rating of one being best, might render a median client rating as an indicator of efficiency. 
In a similar manner, evaluators might rate adverse effects and derive a median value. 

Each of the above five measures has the property that the lower the value, the 
more efficient the program on that particular indicator. This generates ranking order for 
ordinal scales and relative values for interval or ratio comparisons across programs, and 
such measures as these tell us what we need to know about the practicality of programs. 
Combined indicators of efficacy and efficiency result from the products of the practicality 
measures and the NNT. Again, because both the practicality measures and the NNT have 
a lower/better connotation, the combined indicators express ranked values. While these 
types of indicators, along with the NNT itself, offer means for comparing programs of a 
similar nature or multiple applications of the same program over time or location, 
comparison of programs with different agendas would require additional factors for 
consideration, such as relevance, prevalence, severity, criticality, necessity, participation 
rates, etc. 



Display of Information 

An easy display of the above information for comparative purposes involves 
tabling the data with programs on the rows and indicators on the columns. The column 
order might lead with NNT and number of clients (or confidence intervals for the NNT if 
warranted) followed by the practicality indicators and then the combined indicators. A 
final column could contain some aggregate index of the combined indicators. Following 
Ehrenberg’s (Bailar & Mosteller, 1988; Ehrenberg, 1977) rules gives such a table an 
organized and helpful arrangement. 

Medical science offers another way to present the data in the form of a L’ Abbe 
plot (L'Abbe, Detsky, & O'Rourke, 1987), designed to display the results of a meta- 
analysis in terms of outcome rates (Fig. 2). The event rate in the treatment group is 
plotted on the vertical axis while the event rate in the comparison group is plotted on the 
horizontal axis (these are the denominator terms in the NNT equation). Each point on the 
plot represents a single program or application of a program. A forty-five degree angle 
dotted line of equality separates the plot into two halves. If a point falls within the upper 
left half of the plot, then the treatment group fared better than the comparison group (a 
positive NNT). The exact position of the point tells how much better the treatment group 
did relative to the comparison group. A point in the lower right half of the plot represents 
a program where the comparison group bested the treatment group (a negative NNT). A 
program with equivocal results generates a point on the line of equality (an undefined 
NNT). 

Following the example of the Bandolier web site (www.ir2.ox.ac.uk/bandolier/) , 
we indicate the size of the program, in terms of the number of participating clients, via 
the size of a program’s point or circle on the plot. In the absence of confidence intervals 
(always use confidence intervals if warranted), a program’s sample size offers some 
relative indicator of confidence about the NNT when combined with a rigorous appraisal 
of design and methodological strength. Such an appraisal might be represented by 
assigning programs to various categories and coding these in color as shown in figure 3. 
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That is, categories A through C reflect declining quality in program selection, control, 
and application. 

Color-coding has other uses, such as indicating the strength of treatment, value 
level for a practicality measure, subgroup assignment, year of program initiation, and 
level of attainment for the treatment group. In the context of the weight loss program 
example below, figures 4 and 5 illustrate some of these applications. Color-coding 
reinforces the impact of the L’Abbe plot by including important study characteristics with 
event rate effect size reporting and confidence intervals (or sample sizes) as suggested by 
Light, Singer, & Willett (1994). 

Discussion 

Jacobson and Truax (1991) use the example of a treatment for obesity to illustrate 
the difference between effect size and clinical significance. Suppose a quasi- 
experimental weight loss program for people averaging 300 lbs. showed that a 
statistically significant reduction (p=.05) had occurred with the average decline being two 
pounds. A 95% confidence interval about this two pound loss had a lower limit of one 
pound and an upper limit of three pounds. Power analysis indicated that the evaluation 
could have detected a half-pound weight difference with alpha set at .05 and a power of 
.90. Suppose further that a similar program, where the subjects had a standard deviation 
of five pounds, produced an average weight loss of four pounds for a relatively large 
effect size of .80. Also, a meta-analysis confirmed that this and similar programs yielded 
an average effect size of .65 with proportionate weight loss numbers. 

How much meaningful value did the above programs render to their average 
client? Is a two pound or four pound weight loss really substantive to a 300 lb. person? 
Even if the mean weight loss were ten to twenty pounds, was this result due to a few 
clients with massive losses or many people with average losses? Did most clients 
maintain their loss for a reasonable period of time following cessation of intervention? 
How practical were these programs in relation to their efficacy? How do they compare in 
efficacy to other programs? 

Utilizing sustainable clinical improvement and the indicators of practical 
significance produces answers to these questions in a consistent and straightforward 
manner where statistical inference is welcome, when warranted, but not necessary. For 
instance, if clients and evaluators defined sustainable clinical improvement as the loss of 
at least thirty pounds maintained for one year beyond cessation of the program’s 
intervention, then an NNT of four would indicate that four people received treatment for 
one person to achieve success. The practicality measures and the combined indicators 
address efficiency/efficacy questions. L’Abbe plots provide the means for fairly easy 
interpretation of the results for subgroups in the same program, across replications of the 
same program, or in comparison to other programs. Figures 3 through 5 show some 
possible plots for such a weight loss program. 

Shaver (1993) contends that ‘The question of interest is whether an effect size of 
a magnitude judged to be important has been consistently obtained across replications of 
adequate fidelity, not whether the result from a replication was statistically significant or 
whether the design had adequate power for a result to be statistically significant.” Using 
sustainable clinical improvement forces the explicit definition of “a magnitude judged to 
be important” or of what Mosteller called the interocular difference (Scriven, 1993), the 
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difference that hits us between the eyes. The NNT tells us how many attained such a 
difference in reference to what would have happened without a program’s intervention. 
The practicality measures and combined indicators explicate the cost-benefit structure of 
a program; L’Abbe plots help us compare “replications of adequate fidelity.” 
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